Joshua Koonce

Credit Card Users Churn Prediction

Features Selection, Model Selection, and Tuning

The objective of this project is to identify which features influence a customers decision to leave their credit cards and create an algorithm to predict customer churn with a high degree of accuracy.

Various credit cards charge an array of different fees, which are a source of income for banks but may also be contributing factors to customer churn.

Various models will be built and tuned for optimality using Random Search cross-validation approaches. The data will also be under and over sampled to see if it affects model performance.

The data in this dataset is as follows:

Import staple data science libraries

Import the dataset

The dataset has 10,127 rows and 21 columns to start with.

Look at the data types for all columns

We have missing values in Education_Level and Marital_Status only.

Since they are categorical, rather than impute or guess and potentially cause bad data, we should drop the missing value rows since percentage-wise it won't be a huge hit

Data Pre-Processing and Cleaning

Includes cleaning from problems found later in the analysis

Printing the columns makes it easier to copy/paste into the cleaning code

Quick check of the ordinal categoricals:

Check type and missing values again.

The DataFrame is looking a lot cleaner now, with no missing values and data types corrected. 7,081 row remain.

Quick look at Categorical variables and their unique values to ensure we don't have any oddities:

Perform an Exploratory Data Analysis and Provide Insights

Most of the variables are fairly normally distributed. Age, Number of Trips, and Income appear to be clearly right-skewed.

Univariate Distributions EDA

Let's get the distributions of each numeric variable. Many are discrete.

Plots of Individual Variables in Relation to Attrition_Flag

Insights based on Variables vs. Attrition:

Bi-variate and Univariate Distribution (PairPlot)

The pairplot highlights some interesting relationships, but mostly confirms the bivariate analysis.

There appear to be very strong clusters regarding churn. I imagine there will be a strong signal to noise ratio for the machine learning model to differentiate on.

Model Building

For this business problem, I will use the Recall score as my primary metric of interest.

In my opinion, we need to positively identify customers who are likely to churn so that we can see where our credit program is least successful. We can focus less on those classes of customers OR we can modify our credit program to suit the needs of that class of customer. Recall focuses on true positives identified out of total actual positives, in this case a "positive" is churn.

Note: Model performance will be commented on after all models have been built rather than after each model.

Establish Confusion Matrix Metrics function definition.

Create dataframes to store the model score results:

In order to model, I want to create dummy variables for the categorical variables:

Features and Targets: Train/Test Split

Lastly, I want to split the dataset into features and targets before I get to model building. These will get used for naive k-fold cross-validation.

The data is ready to feed into the modeling process.

Regularization: Scaling the Data

Logistic Regression

Fit the model:

Fit Model using KFold Cross Validation and Check Metrics:

Decision Tree

Create the model:

Fit Model using KFold Cross Validation and Check Metrics:

Bagging Classifier (Bagged Tree)

Create the model:

Fit Model using KFold Cross Validation and Check Metrics:

Random Forest Classifier

Create the model:

Fit Model using KFold Cross Validation and Check Metrics:

AdaBoost Classifier

Import additional classifiers needed for boosted approaches.

Create the model:

Fit Model using KFold Cross Validation and Check Metrics:

Gradient Boost Classifier

Create the model:

Fit Model using KFold Cross Validation and Check Metrics:

XGBoost Classifier

Fit the model:

Fit Model using KFold Cross Validation and Check Metrics:

Model Performance Evaluation (Untuned Models, Cross Validation)

I'll be using the top 2 models for Random Search optimization, then I'll use the Random Forest model as the third tuned model just so I'm not using all Boosted models.

Tuned Random Forest

Fit the model (cross-validating within training set):

Find the best model and report the final parameters:

Make Predictions:

Check Metrics (Against Unseen Test Set)

Tuned AdaBoost

Fit the model (cross-validating within training set):

Make Predictions:

Check Metrics (Against Unseen Test Set)

Tuned XGBoost

Fit the model (cross-validating within training set):

Make Predictions:

Check Metrics (Against Unseen Test Set):

Model Performance Evaluation (Tuned Models)

Model Building: Undersampling and Oversampling

Import libraries needed for resampling:

Model Building: Oversampling using SMOTE

Create the oversampled dataset:

Oversampled Tuned Random Forest Model

Fit the Model:

Make Predictions:

Check Metrics (Against Unseen Test Data):

Oversampled Tuned AdaBoost

Make Predictions:

Check Metrics (Against Unseen Test Data):

Oversampled Tuned XGBoost

Fit the model:

Make Predictions:

Check Metrics (Against Unseen Test Data):

Model Building: Undersampling

Create the undersampled data:

Undersampled Tuned Random Forest Model:

Fit the model:

Make Predictions:

Check Metrics (Against Unseen Test Data)

Undersampled Tuned AdaBoost Model:

Fit the model:

Make Predictions:

Check Metrics (Against Unseen Test Data):

Undersampled Tuned XGBoost Model:

Fit the model:

Make Predictions:

Check Metrics (Against Unseen Test Data):

Model Performance

Model performance sorted by Recall:

Productionize the Model: Create and Utilize a Pipeline

The above work has been replicated in a pipeline for the best model.

Actionable Insights and Recommendations

Given the extremely high performance of the Undersampled Tuned XGBoost churn prediction model, I would recommend the company use this model to continually screen customer profiles and flag those that are at high risk for leaving the company's services. It achieves a >95% Recall, and it's accuracy is also outstanding.

The company can then target these customers with incentives or other marketing efforts in order to retain them as a customer, and keep bringing in revenue from them.

Alternatively the company could use this model to segment customers into hard-to-retain groups and potentially divest the services that have high turnover. It could then focus on it's low churn services instead.

Many customers who churn appear to carry little or no balance, so the company may be able to incentivize them to purchase items using their credit card as well by offering things like 0% interest on qualified purchases for a time period.